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Abstract — Statistics about n-grams (i.e., sequences of contigu- 
ous words or other tokens in text documents or other string 
data) are an important building block in information retrieval 
and natural language processing. In this work, we study how 
n-gram statistics, optionally restricted by a maximum n-gram 
length and minimum collection frequency, can be computed 
efficiently harnessing MapReduce for distributed data processing. 
We describe different algorithms, ranging from an extension of 
word counting, via methods based on the Apriori principle, to 
a novel method Suffix-ct that relies on sorting and aggregating 
suffixes. We examine possible extensions of our method to support 
the notions of maximality/closedness and to perform aggregations 
beyond occurrence counting. Assuming Hadoop as a concrete 
MapReduce implementation, we provide insights on an efficient 
implementation of the methods. Extensive experiments on The 
New York Times Annotated Corpus and ClueWeb09 expose the 
relative benefits and trade-offs of the methods. 

I. Introduction 

Applications in various fields including information re- 
trieval [12], [46] and natural language processing [13], [18], 
[39] rely on statistics about n-grams (i.e., sequences of con- 
tiguous words in text documents or other string data) as an 
important building block. Google and Microsoft have made 
available n-gram statistics computed on parts of the Web. 
While certainly a valuable resource, one limitation of these 
datasets is that they only consider n-grams consisting of up 
to five words. With this limitation, there is no way to capture 
idioms, quotations, poetry, lyrics, and other types of named 
entities (e.g., products, books, songs, or movies) that typically 
consist of more than five words and are crucial to applications 
including plagiarism detection, opinion mining, and social 
media analytics. 

MapReduce has gained popularity in recent years both as 
a programming model and in its open-source implementation 
Hadoop. It provides a platform for distributed data processing, 
for instance, on web-scale document collections. MapReduce 
imposes a rigid programming model, but treats its users with 
features such as handling of node failures and an automatic 
distribution of the computation. To make most effective use 
of it, problems need to be cast into its programming model, 
taking into account its particularities. 

In this work, we address the problem of efficiently com- 
puting n-gram statistics on MapReduce platforms. We allow 
for a restriction of the n-gram statistics to be computed by 
a maximum length a and a minimum collection frequency r. 



Only n-grams consisting of up to a words and occurring at 
least t times in the document collection are thus considered. 

While this can be seen as a special case of frequent sequence 
mining, our experiments on two real-world datasets show that 
MapReduce adaptations of APRlORl-based methods [38], [44] 
do not perform well - in particular when long and/or less 
frequent n-grams are of interest. In this light, we develop our 
novel method Suffix-ct that is based on ideas from string 
processing. Our method makes thoughtful use of MapReduce's 
grouping and sorting functionality. It keeps the number of 
records that have to be sorted by MapReduce low and exploits 
their order to achieve a compact main-memory footprint, when 
determining collection frequencies of all n-grams considered. 

We also describe possible extensions of our method. This in- 
cludes the notions of maximality/closedness, known from fre- 
quent sequence mining, that can drastically reduce the amount 
of n-gram statistics computed. In addition, we investigate 
to what extent our method can support aggregations beyond 
occurrence counting, using ?i-gram time series, recently made 
popular by Michel et al. [32], as an example. 

Contributions made in this work include: 

• a novel method SUFFlX-er to compute n-gram statistics 
that has been specifically designed for MapReduce; 

• a detailed account on efficient implementation and pos- 
sible extensions of Suffix-ct (e.g., to consider maxi- 
mal/closed n-grams or support other aggregations); 

• a comprehensive experimental evaluation on The New 
York Times Annotated Corpus (1.8 million news articles 
from 1987-2007) and ClueWeb09-B (50 million web 
pages crawled in 2009), as two large-scale real-world 
datasets, comparing our method against state-of-the-art 
competitors and investigating their trade-offs. 

Suffix-ct outperforms its best competitor in our experi- 
ments by up to a factor 12x when long and/or less frequent 
n-grams are of interest. Otherwise, it performs at least on par 
with the best competitor. 

Organization. Section |ll| introduces our model. Section III 
details on methods to compute n-gram statistics based on prior 



ideas. Section IV introduces our method Suffix-ct. Aspects of 



efficient implementation are addressed in Section [V] Possible 
extensions of Suffix-ct are sketched in Section [VI] Our 
experiments are the subject of Section VII In Section fVIH| w e 
put our work into context, before concluding in Section IX 



II. Preliminaries 

We now introduce our model, establish our notation, and 
provide some technical background on MapReduce. 

A. Data Model 

Our methods operate on sequences of terms (i.e., words 
or other textual tokens) drawn from a vocabulary V. We let S 
denote the universe of all sequences over V. Given a sequence 
s = ( so, . . . , s n -i ) with Si € V, we refer to its length as |s|, 
write s[i..j] for the subsequence ( s.j, . . . , Sj ), and let s[i] refer 
to the element s^. For two sequences r and s, we let r||s denote 
their concatenation. We say that 

• r is a prefix of s (r > s) iff 

VO < i < |r| : r[i] = s[i] 

• r is a suffix of s (r < s) iff 

VO < i < |r| : r[i] = s[\s\ - |r| + i] 

• r is a subsequence of s (r o s) iff 

3 < j < |s| : VO < i < |r| : r[i] = s[i + j] 
and capture how often r occurs in s as 
/(r,s) = |{0< j<|s| | VO < i < |r| : r[%\ = s[i + j] }| . 

To avoid confusion, we use the following convention: When 
referring to sequences of terms having a specific length k, we 
will use the notion fc-gram or indicate the considered length 
by alluding to, for instance, 5-grams. The notion n-gram, as 
found in the title, will be used when referring to variable- 
length sequences of terms. 

As an input, all methods considered in this work receive 
a document collection T> consisting of sequences of terms 
as documents. Our focus is on determining how often n- 
grams occur in the document collection. Formally, the col- 
lection frequency of an n-gram s is defined as as c/(s) = 
^ deX , /(s, d) . Alternatively, one could consider the docu- 
ment frequency of n-grams as the total number of documents 
that contain a specific n-gram. While this corresponds to the 
notion of support typically used in frequent sequence mining, 
it is less common for natural language applications. However, 
all methods presented below can easily be modified to produce 
document frequencies instead. 

B. MapReduce 

MapReduce, as described by Dean and Ghemawat [17], 
is a programming model and an associated runtime sys- 
tem at Google. While originally proprietary, the MapReduce 
programming model has been widely adopted in practice 
and several implementations exist. In this work, we rely on 
Hadoop [1] as a popular open-source MapReduce platform. 
The objective of MapReduce is to facilitate distributed data 
processing on large-scale clusters of commodity computers. 
MapReduce enforces a functional style of programming and 
lets users express their tasks as two functions 

map!) : (kl,vl) -> list< (k2 , v2 ) > 

reduce () : (k2, list<v2>) -> list< (k3 , v3 ) > 



that consume and emit key -value pairs. Between the map- 
and reduce -phase, the system sorts and groups the key- 
value pairs emitted by the map-function. The partitioning of 
key-value pairs (i.e., how they are assigned to cluster nodes) 
and their sort order (i.e., in which order they are seen by the 
reduce -function on each cluster node) can be customized, 
if needed for the task at hand. For detailed introductions to 
working with MapReduce and Hadoop, we refer to Lin and 
Dyer [29] as well as White [41]. 

III. Methods based on prior ideas 

With our notation established, we next describe three meth- 
ods based on prior ideas to compute n-gram statistics in 
MapReduce. Before delving into their details, let us state the 
problem that we address in more formal terms: 

Given a document collection T>, a minimum collection 
frequency t, a maximum length a, our objective is to identify 
all n-grams s with their collection frequency c/(s), for which 
c/(s) > t and |s| < a hold. 

We thus assume that n-grams are only of interest to the 
task at hand, if they occur at least r times in the document 
collection, coined frequent in the following, and consist of at 
most a terms. Consider, as an example task, the construction 
of n-gram language models [46], for which one would only 
look at n-grams up to a specific length and/or resort to back- 
off models [24] to obtain more robust estimates for n-grams 
that occur less than specific number of times. 

The problem statement above can be seen as a special case 
of frequent sequence mining that considers only contiguous 
sequences of single-element itemsets. We believe this to be an 
important special case that warrants individual attention and 
allows for an efficient solution in MapReduce, as we show in 
this work. A more elaborate comparison to existing research 



on frequent sequence mining is part of Section VIII 



To ease our explanations below, we use the following 
running example, considering a collection of three documents: 

di = (axbxx) 
d 2 = ( b a x b x ) 
d 3 = ( x b a x b ) 

3 and a = 3, we expect as output 

3 (b) : 5 (x) : 7 
3 (xb) : 4 



With parameters r 



(a; 

(ax) 
!axb! 



from any method, when applied to this document collection. 

A. Naive Counting 

One of the example applications of MapReduce, given by 
Dean and Ghemawat [17] and also used in many tutorials, is 
word counting, i.e., determining the collection frequency of 
every word in the document collection. It is straightforward 
to adapt word counting to consider variable-length n-grams 
instead of only unigrams and discard those that occur less than 
r times. Pseudo code of this method, which we coin Naive, 
is given in Algorithm [T] 



Algorithm 1: Naive 

// Mapper 

1 map (long did, seq d) begin 

2 for b = to |d| - 1 do 

3 for e = b to min(b + a — 1, |d| — 1) do 

4 I emit (seq d[6..e], long did) 

I 1 Reducer 

1 reduce (seq s, llst<long> I) begin 

2 if \l\ > t then 

3 | emit (seq s, int \l\) 



In the map-function, the method emits all n-grams of length 
up to (7 for a document together with the document identifier. 
If an n-gram occurs more than once, it is emitted multiple 
times. In the reduce-phase, the collection frequency of every 
n-gram is determined and, if it exceeds r, emitted together 
with the n-gram itself. 

Interestingly, apart from minor optimizations, this is the 
method that Brants et al. [13] used for training large-scale 
language models at Google, considering n-grams up to length 
five. In practice, several tweaks can be applied to improve this 
simple method including local pre-aggregation in the map- 
phase (e.g., using a combiner in Hadoop). Implementation 
details of this kind are covered in more detail in Section|V] The 
potentially vast number of emitted key-value pairs that needs 
to be transferred and sorted, though, remains a shortcoming. 

In the worst case, when a > |d|, Naive emits C(|d| 2 ) key- 
value pairs for a document d, each consuming 0(|d|) bytes, 
so that the method transfers C(|d| 3 ) bytes between the map- 
and reduce-phase. Complementary to that, we can determine 
the number of key-value pairs emitted based on the n-gram 
statistics. NAIVE emits a total of J2 s es-\s\<a c f( s ) key-value 
pairs, each of which consumes C(|s|) bytes. 

B. Apriori-Based Methods 

How can one do better than the naive method just outlined? 
One idea is to exploit the APRIORI principle, as described 
by Agrawal et al. [9] in their seminal paper on identifying 
frequent itemsets and follow-up work on frequent pattern min- 
ing [10], [37], [38], [44]. Cast into our setting, the APRIORI 
principle states that 

ros c/(r) > c/(s) 

holds for any two sequences r and s, i.e., the collection 
frequency of a sequence r is an upper bound for the collec- 
tion frequency of any supersequence s. In what follows, we 
describe two methods that make use of the APRIORI principle 
to compute n-gram statistics in MapReduce. 

APRIORI-SCAN: The first APRlORl-based method 
Apriori- Scan, like the original Apriori algorithm [9] 
and GSP [38], performs multiple scans over the input data. 
During the fc-th scan the method determines fc-grams that 
occur at least r times in the document collection. To this end, 
it exploits the output from the previous scan via the APRIORI 
principle to prune the considered fc-grams. In the fc-th scan, 
only those fc-grams are considered whose two constituent 



Algorithm 2: Apriori-Scan 

int k = 1 
repeat 

hashset<int [ ] > diet = load (output-(k — 1)) 

// Mapper 

1 map ( long did, seq d) begin 

2 for b = to |d| - k do 

3 if k = 1 V 

4 (contains (diet, d[b..(b + k - 2)] ) A 

5 contains (diet, d[(b + 1).. (6 + k — 1)])) then 

6 I emit (seq d[6..(6 + k — 1)], long did) 

II Reducer 

1 reduce (seq s, llst<long> I) begin 

2 tt\l\>T then 

3 I emit (seqs, int \l\) 

k+=l 

until isEmpty (output-(k — 1)) V k = a + 1; 



(fc— 1) -grams are known to be frequent. Unlike GSP, that first 
generates all potentially frequent sequences as candidates, 
Apriori-Scan considers only sequences that actually occur 
in the document collection. The method terminates after a 
scans or when a scan does not produce any output. 

Algorithm [2] shows how the method can be implemented 
in MapReduce. The outer repeat-loop controls the execu- 
tion of multiple MapReduce jobs, each of which performs 
one distributed parallel scan over the input data. In the fc- 
th iteration, and thus the fc-th scan of the input data, the 
method considers all fc-grams from an input document in 
the map-function, but discards those that have a constituent 
(fc — l)-gram that is known to be infrequent. This pruning is 
done, leveraging the output from the previous iteration that is 
kept in a dictionary. In the reduce -function, analogous to 
Naive, collection frequencies of fc-grams are determined and 
output if above the minimum collection frequency r. After a 
iterations or once an iteration does not produce any output, the 
method terminates, which is safe since the APRIORI principle 
guarantees that no longer n-gram can occur r or more times 
in the document collection. 

When applied to our running example, in its third scan of 
the input data, Apriori-Scan emits in the map-phase for 
every document d; only the key-value pair ((a x b),dj), 
but discards other trigrams (e.g., ( b x x )) that contain an 
infrequent bigram (e.g., (xx)). 

When implemented in MapReduce, every iteration corre- 
sponds to a separate job that needs to be run and comes 
with its administrative fix cost (e.g., for launching and fi- 
nalizing the job). Another challenge in Apriori-Scan is 
the implementation of the dictionary that makes the output 
from the previous iteration available and accessible to cluster 
nodes. This dictionary can either be implemented locally, so 
that every cluster node receives a replica of the previous 
iteration's output (e.g., implemented using the distributed 
cache in Hadoop), or, by loading the output from the previous 
iteration into a shared dictionary (e.g., implemented using a 
distributed key-value store) that can then be accessed remotely 
by cluster nodes. Either way, to make lookups in the dictionary 



efficient, significant main memory at cluster nodes is required. 

An apparent shortcoming of Apriori-Scan is that it 
has to scan the entire input data in every iteration. Thus, 
although typically only few frequent n-grams are found in 
later iterations, the cost of an iteration depends on the size of 
the input data. The number of iterations needed, on the other 
hand, is determined by the parameter a or the length of the 
longest frequent n-gram. 

In the worst case, when a > |d| and c/(d) > r, 
Apriori-Scan emits C(|d| 2 ) key- value pairs per document 
d, each consuming 0(|d|) bytes, so that the method transfers 
C(|d| 3 ) bytes between the map- and reduce-phase. Again, 
we provide a complementary analysis based on the actual n- 
gram statistics. To this end, let 



Algorithm 3: APRIORI-Index 



S 



NP 



{seS|VreS : (r^sAros)^ c/(r) > r} 



denote the set of sequences that cannot be pruned based on 
the APRIORI principle, i.e., whose true subsequences all occur 
at least t times in the document collection. APRIORI-SCAN 
emits a total of 53seSjv.p-| s l <CT c /( s ) key-value pairs, each of 
which amounts to C(|s|) bytes. Obviously, Snp Q S holds, 
so that Apriori-Scan emits at most as many key-value pairs 
as Naive. Its concrete gains, though, depend on the value of 
t and characteristics of the document collection. 

APRIORI-INDEX: The second APRlORl-based method 
Apriori-Index does not repeatedly scan the input data 
but incrementally builds an inverted index of frequent n- 
grams from the input data as a more compact representation. 
Operating on an index structure as opposed to the original data 
and considering n-grams of increasing length, it resembles 
SPADE [44] when breadth-first traversing the sequence lattice. 

Pseudo code of Apriori-Index is given in Algorithm [3] 
In its first phase, the method constructs an inverted index 
with positional information for all frequent n-grams up to 
length K (cf. Mapper #1 and Reducer #1 in the pseudo 
code). In its second phase, to identify frequent n-grams beyond 
that length, Apriori-Index harnesses the output from the 
previous iteration. Thus, to determine a frequent fc-gram (e.g., 
( b a x)), the method joins the posting lists of its constituent 
(k — l)-grams (i.e., (b a) and (a x)). In MapReduce, 
this can be accomplished as follows (cf. Mapper #2 and 
Reducer #2 in the pseudo code): The map-function emits 
for every frequent (k — l)-gram two key- value pairs. The 
frequent (k — l)-gram itself along with its posting list serves 
in both as a value. As keys the prefix and suffix of length 
(k — 2) are used. In the pseudo code, the method keeps track 
of whether the key is a prefix or suffix of the sequence in 
the value by using the r-seq and 1-seq subtypes. The 
reduce -function identifies for a specific key all compatible 
sequences from the values, joins their posting lists, and emits 
the resulting fc-gram along with its posting list if its collection 
frequency is at least r. Two sequences are compatible and must 
be joined, if one has the current key as a prefix, and the other 
has it as a suffix. In its nested f or-loops, the method considers 
all compatible combinations of sequences. This second phase 
of Apriori-Index can be seen as a distributed candidate 



int k = 1 
repeat 

if k < K then 



// Mapper #1 

map ( long did, seqd) begin 

hashmap<seq, int [ ] > pos = 
for b = to |d| - 1 do 
|_ add (get (pos, d[b..(b + k - 1)]) , 6) 

for seq s : keys (pos) do 
| emit (seg s, posting {did, get (pos,s) )) 

II Reducer #1 

reduce { seq s, list<posting> I) begin 
if cf (I) >t then 

I emit (seg s, list<posting> I) 



else 



// Mapper #2 

map (seg s, list<posting> I) begin 
emit (seg s[0..|s| — 2], 

(r-seq, list<posting>) (s, /)) 
emit (segs[l..|s| — 1], 

(1-seq, list<posting>) (s, I)) 

II Reducer #2 

reduce (seg s, list<(seq, list<posting>)> I) 
begin 

for (1-seq. list<posting>) (m, l m ) : I do 
for (r-seq, list<posting>) (n,l n ) : I do 
list<posting> lj = join (l m , In) 
if c f ( I j :) > t then 

seq j = m || (n[|n| - 1] ) 
emit (seg j, list<posting> lj) 



k += 1 

until isEmpty (output-(k — 1)) V k = min(cr,K); 



generation and pruning step. 

Applied to our running example and assuming K = 2, the 
method only sees one pair of compatible sequences with their 
posting lists for the key ( x ) in its third iteration, namely: 

(ax) : (di : [0], d 2 : [1], d 3 : [2]) 
(xb) : (di : [1], d 2 : [2], d 3 : [0,3]) . 

By joining those, Apriori-Index obtains the only frequent 
3-gram with its posting list 



axb! 



;d i: [0], d a :[l], da: [2]; 



For all fc < K, it would be enough to determine only 
collection frequencies, as opposed to, positional information 
of n-grams. While a straightforward optimization in practice, 
we opted for simpler pseudo code. When implemented as 
described in Algorithm [3] the method produces an inverted 
index with positional information that can be used to quickly 
determine the locations of a specific frequent n-gram. 

One challenge when implementing Apriori-Index is that 
the number and size of posting-list values seen for a specific 
key can become large in practice. Moreover, to join compatible 
sequences, these posting lists have to be buffered, and a 
scalable implementation must deal with the case when this 
is not possible in the available main memory. This can, for 



instance, be accomplished by storing posting lists temporarily 
in a disk-resident key-value store. 

The number of iterations needed by Apriori-Index is 
determined by the parameter a or the length of the longest 
frequent n-gram. Since every iteration, as for Apriori-Scan, 
corresponds to a separate MapReduce job, a non-negligible 
administrative fix cost is incurred. 

In the worst case, when a > |d| and c/(d) > t, Apriori- 
Index emits C(|d| 2 ) key- value pairs per document d, each 
consuming 0(|d|) bytes, so that C(|d| 3 ) bytes are transferred 
the map- and reduce-phase. We assume K < a for the com- 
plementary analysis. In its first K iterations, Apriori-Index 
emits X)seS-|s|<i<: ^/( s ) key-value pairs, where df (s) < c/(s) 
refers to the document frequency of the n-gram s, as men- 
tioned in Section |H| Each key-value pair consumes 0(c/(s)) 
bytes. To analyze the following iterations, let 

S F = {s e S | c/(s) > r} 

denote the set of frequent ?i-grams that occur at least t times. 
Apriori-Index emits a total of 

2-|{seS F | K < |s| <a}\ 

key-value pairs, each of which consumes C(c/(s)) bytes. 
Like for Apriori-Scan, the concrete gains depend on the 
value of t and characteristics of the document collection. 

IV. Suffix sorting & aggregation 

As already argued, the methods presented so far suffer from 
either excessive amounts of data that need to be transferred 
and sorted, requiring possibly many MapReduce jobs, or a 
high demand for main memory at cluster nodes. Our novel 
method SUFFlX-er avoids these deficiencies: It requires a single 
MapReduce job, transfers only a modest amount of data, and 
requires little main memory at cluster nodes. 

Consider again what the map-function in the Naive ap- 
proach emits for document d3 from our running example. 
Emitting key-value pairs for all of the n-grams ( b a x ), ( b a ), 
and ( b ) is clearly wasteful. The key observation here is that 
the latter two are subsumed by the first one and can be obtained 
as its prefixes. Suffix arrays [31] and other string processing 
techniques exploit this very idea. 

Based on this observation, it is safe to emit key-value pairs 
only for a subset of the n-grams contained in a document. 
More precisely, it is enough to emit at every position in the 
document a single key-value pair with the suffix starting at 
that position as a key. These suffixes can further be truncated 
to length a - hence the name of our method. 

To determine the collection frequency of a specific n-gram 
r, we have to determine how many of the suffixes emitted 
in the map-phase are prefixed by r. To do so correctly 
using only a single MapReduce job, we must ensure that all 
relevant suffixes are seen by the same reducer. This can be 
accomplished by partitioning suffixes based on their first term 
only, as opposed to, all terms therein. It is thus guaranteed 
that a single reducer receives all suffixes that begin with the 
same term. This reducer is then responsible for determining the 



Algorithm 4: Suffix-ct 



// Mapper 

map (long did, seqd) begin 
for b = to |d| - 1 do 

I emit (seq d[b..min(b + a - 



1, |d| - 1)], long did) 



in 
n 

12 



// Reducer 

stack<int> terms = 

stack<int> counts = 

reduce ( seq s, list<long> I) begin 

while lcp (s.seq {terms) ) < len (terms) do 
if peek (counts) > r then 
I emit (seq seq (terms) , Int peek (counts) ) 

pop (terms) 

push (counts, pop (counts) + pop (counts) ) 

if len (terms) = |s| then 

| push (counts, pop (counts) + 
else 

for i = lcp (s, seq (terms) ) to |s| — 1 do 
push (terms, s[i] ) 

push (counts, (i == |s| — 1 ? \l\ : 0)) 



cleanup ( ) begin 
|^ reduce (seq 0, list<long> 0) 

// Partitioner 
partition ( seq s) begin 
| return hashcode (s[0] ) modi? 

// Comparator 

compare ( seq r, seq s) begin 

2 for b = to min(\r\, |s|) — 1 do 

3 if r[6] < s[6] then 

4 | return +1 

5 else if r[6] > s[6] then 

6 return —1 

return |s| — |r| 



collection frequencies of all ?i-grams starting with that term. 
One way to accomplish this would be to enumerate all prefixes 
of a received suffix and aggregate their collection frequencies 
in main memory (e.g., using a hashmap or a prefix tree). Since 
it is unknown whether an n-gram is represented by other yet 
unseen suffixes from the input, it cannot be emitted early along 
with its collection frequency. Bookkeeping is thus needed for 
many n-grams and requires significant main memory. 

How can we reduce the main-memory footprint and emit n- 
grams with their collection frequency early on? The key idea 
is to exploit that the order in which key-value pairs are sorted 
and received by reducers can be influenced. Suffix-ct sorts 
key-value pairs in reverse lexicographic order of their suffix 
key, formally defined as follows for sequences r and s: 

r < s ^ (|r| > |s| A s > r) V 
30<« < min(\v\, |s|):r[«] > s[i] A V0<j < i : r[j] = s[j] . 

To see why this is useful, recall that each suffix from 
the input represents all n-grams that can be obtained as its 
prefixes. Let s denote the current suffix from the input. The 
reverse lexicographic order guarantees that we can safely emit 
any n-gram r such that r < s, since no yet unseen suffix 
from the input can represent r. Conversely, at this point, 
the only n-grams for which we have to do bookkeeping, 



since they are represented both by the current suffix s and 
potentially by yet unseen suffixes, are the prefixes of s. We 
illustrate this observation with our running example. The 
reducer responsible for suffixes starting with b receives: 

(bxx) : (di> 

(bx) : (d a ) 

(bax) : (d 2 , da) 

<b> : (da). 

When seeing the third suffix (bax), we can immediately 
finalize the collection frequency of the n-gram ( b x ) and emit 
it, since no yet unseen suffix can have it as a prefix. On the 
contrary, the n-grams ( b ) and ( b a ) cannot be emitted, since 
yet unseen suffixes from the input may have them as a prefix. 

Building on this observation, we can do efficient bookkeep- 
ing for prefixes of the current suffix s only and lazily aggregate 
their collection frequencies using two stacks. On the first stack 
terms, we keep the terms constituting s. The second stack 
counts keeps one counter per prefix of s. Between invocations 
of the reduce -function, we maintain two invariants. First, the 
two stacks have the same size m. Second, X^'^ 1 counts[j] 
reflects how often the 71-gram ( terms [0], . . . , terms[i] ) has 
been seen so far in the input. To maintain these invariants, 
when processing a suffix s from the input, we first syn- 
chronously pop elements from both stacks until the contents 
of terms form a prefix of s. Before each pop operation, we 
emit the contents of terms and the top element of counts, 
if the latter is above our minimum collection frequency t. 
When popping an element from counts, its value is added 
to the new top element. Following that, we update terms, so 
that its contents equal the suffix s. For all but the last term 
added, a zero is put on counts. For the last term, we put 
the frequency of s, reflected by the length of its associated 
document-identifier list value, on counts. Figure [T] illustrates 
how the states of the two stacks evolve, as the above example 
input is processed. 
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Fig. 1. SUFFIX-cr's bookkeeping illustrated 



Pseudo code of Suffix-ct is given in Algorithm[4] The map- 
function emits for every document all its suffixes truncated 
to length a if possible. The reduce -function reads suffixes 
in reverse lexicographic order and performs the bookkeeping 
using two separate stacks for n-grams (terms) and their 
collection frequencies (counts), as described above. The func- 
tion seq() returns the n-gram corresponding to the entire 
terms stack. The function lcp() returns the length of the 
longest common prefix that two n-grams share. In addition, 
Algorithm |4] contains a partition-function ensuring that 
suffixes are assigned to one of R reducers solely based on 
their first term, as well as, a compare-function that ensures 
the reverse lexicographic order of input suffixes in the map- 
phase. When implemented in Hadoop, these two functions 



would materialize as a custom partitioner class and a custom 
comparator class. Finally, cleanup ( ) is a method invoked 
once, when all input has been seen. 

SUFFlX-cr emits C(|d|) key -value pairs per document d. 
Each of these key- value pairs consumes 0(|d|) bytes in the 
worst case when a > |d|. The method thus transfer C(|d| 2 ) 
bytes between the map- and reduce-phase. For every term 
occurrence in the document collection, SUFFlX-cr emits exactly 
one key-value pair, so that in total X) s es-|s|=i c /( s ) key-value 
pairs are emitted, each consuming 0(a) bytes. 

V. Efficient implementation 

Having described the different methods at a conceptual 
level, we now provide details on aspects of their imple- 
mentation, which we found to have a significant impact on 
performance in practice: 

Document Splits. Collection frequencies of individual 
terms (i.e., unigrams) can be exploited to drastically reduce 
required work by splitting up every document at infrequent 
terms that it contains. Thus, assuming that z is an infrequent 
term given the current value of r, we can split up a document 
like ( c b a z b a c ) into the two shorter sequences ( c b a ) 
and ( b a C ). Again, this is safe due to the APRIORI principle, 
since no frequent n-gram can contain z. All methods profit 
from this - for large values of a in particular. 

Sequence Encoding. It is inefficient to operate on doc- 
uments in a textual representation. As a one-time prepro- 
cessing, we therefore convert our document collections, so 
that they are represented as a dictionary, mapping terms to 
term identifiers, and one integer term-identifier sequence for 
every document. We assign identifiers to terms in descending 
order of their collection frequency to optimize compression. 
From there on, our implementation internally only deals with 
arrays of integers. Whenever serialized for transmission or 
storage, these are compactly represented using variable-byte 
encoding [42]. This also speeds up sorting, since n-grams 
can now be compared using integer operations as opposed to 
operations on strings, thus requiring generally fewer machine 
instructions. Compact sequence encoding benefits all methods 
- in particular Apriori-Scan with its repeated scans of the 
document collection. 

Key-Value Store. For Apriori-Scan and Apriori- 
Index, reducers potentially buffer a lot of data, namely, the 
dictionary of frequent (k — l)-grams or the set of posting 
lists to be joined. Our implementation keeps this data in 
main memory as long as possible. Otherwise, it migrates the 
data into a disk-resident key-value store (Berkeley DB Java 
Edition [3]). Most main memory is then used for caching, 
which helps Apriori-Scan in particular, since lookups of 
frequent (k — l)-grams typically hit the cache. 

Hadoop-Speciflc Optimizations that we use in our imple- 
mentation include local aggregation (cf. Mapper # 1 in Algo- 
rithm [3}, Hadoop's distributed cache facility, raw comparators 
to avoid deserialization and object instantiation, as well as 
other best practices (e.g., described in [41]). 

How easy to implement are the methods presented in pre- 
vious sections? While hard to evaluate systematically, we still 



want to address this question based on our own experience. 
Naive is the clear winner here. Implementations of the 
APRlORl-based methods, as explained in Section III require 



various tweaks (e.g., the use of a key-value store) to make 
them work. SUFFlX-c does not require any of those and, 
when Hadoop is used as a MapReduce implementation, can 
be implemented using only on-board functionality. 

VI. Extensions 

In this section, we describe how SUFFlX-er can be extended 
to consider only maximal/closed ?i-grams and thus produce a 
more compact result. Moreover, we explain how it can support 
aggregations beyond occurrence counting, using n-gram time 
series, recently made popular by [32], as an example. 
A. Maximality & Closedness 

The number of n-grams that occur at least r times in the 
document collection can be huge in practice. To reduce it, we 
can adopt the notions of maximality and closedness common 
in frequent pattern mining. Formally, an n-gram r is maximal, 
if there is no n-gram s such that ros and c/(s) > r. Similarly, 
an n-gram r is closed, if no n-gram s exists such that ros and 
c/(r) = c/(s) > r. The sets of maximal or closed n-grams 
are subsets of all n-grams that occur at least r times. Omitted 
n-grams can be reconstructed - for closedness even with their 
accurate collection frequency. 

Suffix-ct can be extended to produce maximal or closed 
n-grams. Recall that, in its reduce -function, our method 
processes suffixes in reverse lexicographic order. Let r denote 
the last n-gram emitted. For maximality, we only emit the next 
n-gram s, if it is no prefix of r (i.e., ->(s>r)). For closedness, 
we only emit s, if it is no prefix of r or if it has a different 
collection frequency (i.e., ^(s>r A c/(s) = c/(r))). In our 
example, the reducer responsible for term a receives 

(axb) : (di, d 2 , d 3 ) 

and, both for maximality and closedness, emits only the n- 
gram (axb) but none of its prefixes. With this extension, 
we thus emit only prefix-maximal or prefix-closed n-grams, 
whose formal definitions are analogous to those of maximality 
and closedness above, but replace o by >. In our example, 
we still emit ( x b ) and ( b ) on the reducers responsible for 
terms x and b, respectively. For maximality, as subsequences 
of (axb), these n-grams must be omitted. We achieve this 
by means of an additional post-filtering MapReduce job. As 
input, the job consumes the output produced by Suffix-ct 
with the above extensions. In its map-function, n-grams are 
reversed (e.g., (axb) becomes ( b x a)). These reversed n- 
grams are partitioned based on their first term and sorted in 
reverse lexicographic order, reusing ideas from SUFFlX-c. In 
the reduce -function, we apply the same filtering as described 
above to keep only prefix-maximal or prefix-closed reversed 
n-grams. Before emitting a reversed n-gram, we restore its 
original order by reversing it. In our example, the reducer 
responsible for b receives 

(bxa) : 3 
(bx) : 4 
(b) : 5 



and, for maximality, only emits (axb). In summary, we 
obtain maximal or closed n-grams by first determining prefix- 
maximal or prefix-closed n-grams and, after that, identifying 
the suffix-maximal or suffix-closed among them. 

B. Beyond Occurrence Counting 

Our focus so far has been on determining collection fre- 
quencies of n-grams, i.e., counting their occurrences in the 
document collection. One can move beyond occurrence count- 
ing and aggregate other information about n-grams, e.g.: 

• build an inverted index that records for every n-gram how 
often or where it occurs in individual documents; 

• compute statistics based on meta-data of documents (e.g., 
timestamp or location) that contain a n-gram. 

In the following, we concentrate on the second type of aggre- 
gation and, as a concrete instance, consider the computation 
of n-gram time series. Here, the objective is to determine for 
every n-gram a time series whose observations reveal how 
often the n-gram occurs in documents published, e.g., in a 
specific year. SUFFlX-a can be extended to produce such n- 
gram time series as follows: In the map-function we emit every 
suffix along with the document identifier and its associated 
timestamp. In the reduce -function, the counts stack is re- 
placed by a stack of time series, which we aggregate lazily. 
When popping an element from the stack, instead of adding 
counts, we add time series observations. In the same manner, 
we can compute other statistics based on the occurrences of 
an n-gram in documents and their associated meta-data. While 
these could also be computed by an extension of Naive, the 
benefit of using SUFFlX-er is that the required document meta- 
data is transferred only per suffix of a document, as opposed 
to, per contained n-gram. 

VII. Experimental evaluation 

We conducted comprehensive experiments to compare the 
different methods and understand their relative benefits and 
trade-offs. Our findings from these experiments are the subject 
of this section. 

A. Setup & Implementation 

Cluster Setup. All experiments were run on a local cluster 
consisting of ten Dell R410 server-class computers, each 
equipped with 64 GB of main memory, two Intel Xeon X5650 
6-core CPUs, and four internal 2 TB SAS 7,200 rpm hard 
disks configured as a bunch-of-disks. Debian GNU/Linux 5.0.9 
(Lenny) was used as an operating system. Machines in the 
cluster are connected via 1 GBit Ethernet. We use Cloudera 
CDH3u0 as a distribution of Hadoop 0.20.2 running on Oracle 
Java 1.6.0J26. One of the machines acts a master and runs 
Hadoop's namenode and jobtracker; the other nine machines 
are configured to run up to ten map tasks and ten reduce 
tasks in parallel. To restrict the number of map/reduce slots, 
we employ a capacity-constrained scheduler pool in Hadoop. 
When we state that n map/reduce slots are used, our cluster 
executes up to n map tasks and n reduce tasks in parallel. 
Java virtual machines to process tasks are always launched 
with 4 GB heap space. 



TABLE I 
Dataset characteristics 





NYT 


C09 


# documents 


1,830,592 


50,221,915 


# term occurrence 


1,049,440,645 


21,404,321,682 


# distinct terms 


345,827 


979, 935 


# sentences 


55,362,552 


1,257, 357, 167 


sentence length (mean) 


18.96 


17.02 


sentence length (stddev) 


14.05 


17.56 



Implementation. All methods are implemented in Java 
(JDK 1.6) applying the optimizations described in Section [V] 
to the extent possible and sensible for each of them. 

Methods. We compare the methods Naive, Apriori- 
Scan, Apriori-Index, and Suffix-ct in our experiments. 
For Apriori-Index, we set K = 4, so that the method 
directly computes collection frequencies of n-grams having 
length four or less. We found this to be the best-performing 
parameter setting in a series of calibration experiments. 

Measures. For our experiments in the following, we report 
as performance measures: 

(a) wallclock time as the total time elapsed between launch- 
ing a method and receiving the final result (possibly 
involving multiple Hadoop jobs), 

(b) bytes transferred as the total amount of data transferred 
between map- and reduce-phase(s) (obtained from 
Hadoop's map_OUTPUT_bytes counter), 

(c) # records as the total number of key-value pairs 
transferred and sorted between map- and reduce- 
phase(s) (obtained from Hadoop's map_OUTPUT_records 
counter). 

For Apriori-Scan and Apriori-Index, measures (b) and 
(c) are aggregates over all Hadoop jobs launched. All measure- 
ments reported are based on single runs and were performed 
with exclusive access to the Hadoop cluster, i.e., without 
concurrent activity by other jobs, services, or users. 

B. Datasets 

We use two publicly-available real-world datasets for our 
experiments, namely: 

• The New York Times Annotated Corpus [7] consisting 
of more than 1.8 million newspaper articles from the 
period 1987-2007 (NYT); 

• ClueWeb09-B [6], as a well-defined subset of the 
ClueWeb09 corpus of web documents, consisting of more 
than 50 million web documents in English language that 
were crawled in 2009 (CW). 

These two are extremes: NYT is a well-curated, relatively 
clean, longitudinal corpus, i.e., documents therein have a clear 
structure, use proper language with few typos, and cover 
a long time period. CW is a "World Wild Web" corpus, 
i.e., documents therein are highly heterogeneous in structure, 
content, and language. 

For NYT a document consists of the newspaper article's 
title and body. To make CW more handleable, we use boil- 
erplate detection as described by Kohlschiitter et al. [25] and 
implemented in boilerpipe's [4] default extractor, to identify 



the core content of documents. On both datasets, we use 
OpenNLP [2] to detect sentence boundaries in documents. 
Sentence boundaries act as barriers, i.e., we do not consider 
n-grams that span across sentences in our experiments. As 
described in Section [V] in a pre-processing step, we convert 
both datasets into sequences of integer term-identifiers. The 
term dictionary is kept as a single text file; documents are 
spread as key-value pairs of 64-bit document identifier and 
content integer array over a total of 256 binary files. Table [I] 
summarizes characteristics of the two datasets. 

C. Output Characteristics 

Let us first look at the 7i-gram statistics that (or, parts of 
which) we expect as output from all methods. To this end, 
for both document collections, we determine all n-grams that 
occur at least five times (i.e., t = 5 and a = oo). We bin 
n-grams into 2-dimensional buckets of exponential width, i.e., 
the n-gram s with collection frequency c/(s) goes into bucket 
where i = [log 10 |s|J and j = [log w c/(s)J. Figure [5] 
reports the number of n-grams per bucket. 

The figure reveals that the distribution is biased toward short 
and less frequent n-grams. Consequently, as we lower the 
value of r, all methods have to deal with a drastically increas- 
ing number of n-grams. What can also be seen from Figure [2] 
is that, in both datasets, n-grams exist that are very long, 
containing hundred or more terms, and occur more than ten 
times in the document collection. Examples of long n-grams 
that we see in the output include ingredient lists of recipes 
(e.g., ...1 tablespoon cooking oil ...) and chess open- 
ings (e.g., e4 e5 2 nf3...)in NYT; in CW they include 
web spam (e.g., travel tips san miguel tourism san 
miguel transport san miguel ...) as well as error mes- 
sages and stack traces from web servers and other software 
(e.g., ...php on line 91 warning...) that also occur 
within user discussions in forums. For the APRlORl-based 
methods, such long n-grams are unfavorable, since they re- 
quire many iterations to identify them. 

D. Use Cases 

As a first experiment, we investigate how the methods 
perform for parameter settings chosen to reflect two typical use 
cases, namely, training a language model and text analytics. 
For the first use case, we set t = 10 on NYT and t = 100 
on CW, as relatively low minimum collection frequencies, in 
combination with a = 5. The n-gram statistics made public by 
Google [5], as a comparison, were computed with parameter 
settings r = 40 and a — 5 on parts of the Web. For the 
second use case, we choose a — 100, as a relatively high 
maximum sequence length, combined with r = 100 on NYT 
and r = 1,000 on CW. The idea in the analytics use case 
is to identify recurring fragments of text (e.g., quotations or 
idioms) to be analyzed further (e.g., their spread over time). 

Figure [3] reports wallclock-time measurements obtained 
for these two use cases with 64 map/reduce slots. For our 
language-model use case, SUFFlX-cr outperforms APRIORI- 
Scan as the best competitor by a factor 3x on both datasets. 
For our analytics use case, we see a factor 12x improvement 
over Apriori-Index as the best competitor on NYT; on CW 
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Fig. 2. Output characteristics as # of n-grams s with c/(s) > 5 per n-gram length and collection frequency 
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Fig. 3. Wallclock times in minutes for (a) training a language model (a 
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Suffix-ct still outperforms the next best Apriori-Scan by a 
factor 1.5x. Measurements for Naive on CW in are missing, 
since the method did not complete in reasonable time. 

E. Varying Minimum Collection Frequency 

Our second experiment studies how the methods behave 
as we vary the minimum collection frequency r. We use a 
maximum length a = 5 and apply all methods to the entire 
datasets. Measurements are performed using 64 map/reduce 
slots and reported in Figure |4] 

We observe that for high minimum collection frequencies, 
Suffix-ct performs as well as the best competitor Apriori- 
Scan. For low minimum collection frequencies, it signifi- 
cantly outperforms the other methods. Both APRlORl-based 
method show steep increases in wallclock time as we lower 
the minimum collection frequency - especially when we reach 
the lowest value of r on each document collection. This is 
natural, because for both methods the work that has to be done 
in the k-th iteration depends on the number of (A; — 1) -grams 
output in the previous iteration, which have to be joined or 



kept in a dictionary, as described in Section III As observed 



in Figure [2] above, the number of fc-grams grows drastically 
as we decrease the value of r. When looking at the number of 
bytes and the number of records transferred, we see analogous 
behavior. For low values of r, SUFFlX-a transfers significantly 
less data than its competitors. 

F. Varying Maximum Length 

In this third experiment, we study the methods' behavior 
as we vary the maximum length cr. The minimum collection 
frequency is set as r = 100 for NYT and r = 1, 000 for CW 



to reflect their different scale. Measurements are performed on 
the entire datasets with 64 map/reduce slots and reported in 
Figure [5] Measurements for cr > 5 are missing for Naive on 
CW, since the method did not finish within reasonable time 
for those parameter settings. 

Suffix-ct is on par with the best-performing competitor 
on CW, when considering n-grams of length up to 50. For 
a = 100, it outperforms the next best APRIORI-SCAN by 
a factor 1.5x. On NYT, Suffix-c consistently outperforms 
all competitors by a wide margin. When we increase the 
value of cr, the APRlORl-based methods need to run more 
Hadoop jobs, so that their wallclock times keep increasing. 
For Naive and Suffix-ct, on the other hand, we observe a 
saturation of wallclock times. This is expected, since these 
methods have to do additional work only for input sequences 
longer than cr consisting of terms that occur at least r times 
in the document collection. When looking at the number of 
bytes and the number of records transferred, we observe a 
saturation for NAIVE for the reason mentioned above. For 
Suffix-ct only the number of bytes saturates, the number of 
records transferred is constant, since it depends only on the 
minimum collection frequency r. Further, we see that SUFFIX- 
cr consistently transfers fewest records. 

G. Scaling the Datasets 

Next, we investigate how the methods react to changes in 
the scale of the datasets. To this end, both from NYT and 
CW, we extract smaller datasets that contain a random 25%, 
50%, or 75% subset of the documents. Again, the minimum 
collection frequency is set as r — 100 for NYT and r = 1, 000 
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for CW. The maximum length is set as a = 5. Wallclock times 
are measured using 64 map/reduce slots. 

From Figure [6] we observe that Naive handles additional 
data equally well on both datasets. The other methods' scal- 
ability is comparable to that of Naive on CW, as can be 



seen from their almost-identical slopes. On NYT, in con- 
trast, Apriori-Scan, Apriori-Index, and Suffix-o- cope 
slightly better with additional data than Naive. This is due to 
the different characteristics of the two datasets. 
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H. Scaling Computational Resources 

Our final experiment explores how the methods behave as 
we scale computational resources. Again, we set r = 100 for 
NYT and r = 1,000 for CW. All methods are applied to 
the 50% samples of documents from the collections. We vary 
the number of map/reduce slots as 16, 32, 48, and 64. The 
number of cluster nodes remains constant in this experiment, 
since we cannot add/remove machines to/from the cluster due 
to organizational restrictions. We thus only vary the amount 
of parallel work every machine can do; their total number 
remains constant throughout this experiment. 

We observe from Figure|7]that all methods show comparable 
behavior as we make additional computational resources avail- 
able. Or, put differently, all methods make equally effective 
use of them. What can also be observed across all methods 
is that the gains of adding more computational resources are 
diminishing - because of mappers and reducers competing 
for shared devices such as hard disks and network interfaces. 
This phenomenon is more pronounced on NYT than CW, since 
methods take generally less time on the smaller dataset, so that 
competition for shared devices is fiercer and has no chance to 
level out over time. 
Summary 

What we see in our experiments is that Suffix-ct out- 
performs its competitors when long and/or less frequent n- 
grams are considered. Even otherwise, when the focus is 
on short and/or very frequent n -grams, SUFFlX-er performs 
never significantly worse than the other methods. It is hence 
robust and can handle a wide variety of parameter choices. 
To substantiate this, consider that SUFFlX-er could compute 
statistics about arbitrary-length 7i-grams that occur at least five 
times (i.e., r = 5 and a = oo), as reported in Figure [2] in less 
than six minutes on NYT and six hours on CW. 

VIII. Related Work 

We now discuss the connection between this work and 
existing literature, which can broadly be categorized into: 

Frequent Pattern Mining goes back to the seminal work 
by Agrawal et al. [8] on identifying frequent itemsets in 
customer transactions. While the APRIORI algorithm described 
therein follows a candidate generation & pruning approach, 
Han et al. [20] have advocated pattern growth as an alternative 
approach. To identify frequent sequences, which is a problem 



closer to our work, the same kinds of approaches can be used. 
Agrawal and Srikant [10], [38] describe candidate generation 
& pruning approaches; Pei et al. [37] propose a pattern-growth 
approach. SPADE by Zaki [44] also generates and prunes 
candidates but operates on an index structure as opposed 
to the original data. Parallel methods for frequent pattern 
mining have been devised both for distributed-memory [19] 
and shared-memory machines [36], [45]. Little work exists 
that assumes MapReduce as a model of computation. Li et 
al. [26] describe a pattern-growth approach to mine frequent 
itemsets in MapReduce. Huang et al. [22] sketch an ap- 
proach to maintain frequent sequences while sequences in 
the database evolve. Their approach is not applicable in our 
setting, since it expects input sequences to be aligned (e.g, 
based on time) and only supports document frequency. For 
more detailed discussions, we refer to Ceglar and Roddick [14] 
for frequent itemset mining, Mabroukeh and Ezeife [30] for 
frequent sequence mining, and Han et al. [21] for frequent 
pattern mining in general. 

Natural Language Processing & Information Retrieval. 

Given their role in NLP, multiple efforts [11], [15], [18], [23], 
[39] have looked into n-gram statistics computation. While 
these approaches typically consider document collections of 
modest size, recently Lin et al. [27] and Nguyen et al. [34] 
targeted web-scale data. Among the aforementioned work, 
Huston et al. [23] is closest to ours, also focusing on less 
frequent n-grams and using a cluster of machines. However, 
they only consider n-grams consisting of up to eleven words 
and do not provide details on how their methods can be 
adapted to MapReduce. Yamamoto and Church [43] augment 
suffix arrays, so that the collection frequency of substrings in 
a document collection can be determined efficiently. Bernstein 
and Zobel [12] identify long n-grams as a means to spot co- 
derivative documents. Brants et al. [13] and Wang et al. [40] 
describe the ?i-gram statistics made available by Google and 
Microsoft, respectively. Zhai [46] gives details on the use 
of n-gram statistics in language models. Michel et al. [32] 
demonstrated recently that 7i-gram time series are powerful 
tools to understand the evolution of culture and language. 

MapReduce Algorithms. Several efforts have looked into 
how specific problems can be solved using MapReduce, 
including all -pairs document similarity [28], processing re- 



NYT 



CW 




iriori-Scan ■■ 
nori-lndcx 
Naive 
Suffix n - 



# of Slots 



# of Slots 



(a) Wallclock times (b) Wallclock times 

Fig. 7. Scaling computational resources 



lational joins [35], coverage problems [16], content match- 
ing [33]. However, no existing work has specifically addressed 
computing rt-gram statistics in MapReduce. 

IX. Conclusions 

In this work, we have presented Suffix-ct, a novel method 
to compute n-gram statistics using MapReduce as a platform 
for distributed data processing. Our evaluation on two real- 
world datasets demonstrated that Suffix-ct outperforms Map- 
Reduce adaptations of APRlORl-based methods significantly, 
in particular when long and/or less frequent n-grams are con- 
sidered. Otherwise, Suffix-ct is robust, performing at least on 
par with the best competitor. We also argued that our method is 
easier to implement than its competitors, having been designed 
with MapReduce in mind. Finally, we established our method's 
versatility by showing that it can be extended to produce 
maximal/closed n-grams and perform aggregations beyond 
occurrence counting. 
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